Model Regularization

10/13/2020

Agenda

Goals of model selection and regularization
Subset selection
Ridge regression
Lasso regression

Note: although we talk about regression here, everything applies to logistic regression as well (and hence classification).

Model Building Process

With a set of variables in hand, the goal is to select the best model. Why not include all the variables?

Big models tend to over-fit and find features that are specific to the data in hand, i.e. not generalizable relationships.

In addition, bigger models have more parameters and potentially more uncertainty about everything we are trying to learn.

We need a strategy to build a model in ways that account for the trade-off between bias and variance: subset selection, shrinkage, dimension reduction.

Best subset selection: example

Shrinkage methods

The subset selection methods use least squares to fit a linear model that contains a subset of the predictors.

As an alternative, we can work with all \(p\) predictors, but at the same time shrink the coefficients towards zero relative to the least squares estimates
This shrinkage (also known as regularization) has the effect of reducing variance (introducing some bias) and selecting variables
The hope is that by having the shrinkage in place, the estimation procedure will be able to focus on the “important” \(\beta_j\)’s

Ridge regression

Recall that the least squares fitting procedure estimates \(\beta_0, \dots, \beta_p\) using the values that minimize

\[RSS = \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2\]

Ridge Regression is a modification of the least squares criteria that minimizes

\[\underbrace{\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2}_\text{traditional objective function of LS} + \underbrace{\lambda \sum_{j=1}^p \beta_j^2}_\text{shrinkage penalty} = RSS + \lambda \sum_{j=1}^p \beta_j^2\]

where \(\lambda > 0\) is a tuning parameter, to be determined separately.

Ridge regression

\[\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p \beta_j^2\]

As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small
However, the second term \(\lambda \sum_{j=1}^p \beta_j^2\) (shrinkage penalty) favors \(\beta_1, \dots, \beta_p\) that are close to zero, and so it has the effect of shrinking the estimates of \(\beta_j\) towards zero
Careful: \(\beta_0\) does not get penalized!

Ridge regression

\[\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p \beta_j^2\]

So, for different values of \(\lambda\) we get a different solution to the problem: use cross-validation!
- when \(\lambda = 0\) we are back to least squares
- when \(\lambda \rightarrow +\infty\), it is “too expensive” to allow for any \(\beta_{j}\) to be far from \(0\)

Solution path

Bias-variance tradeoff

Simulated data with \(n = 50\), \(p = 45\) all having nonzero coefficients. Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of \(\lambda\). The horizontal dashed lines indicate the minimum possible MSE. The purple crosses indicate the ridge regression models for which the MSE is smallest.

Standardizing the variables

The OLS solution is scale equivariant: multiplying \(X_j\) by a constant \(c\) simply leads to a scaling of the least squares coefficient estimates by a factor of \(1/c\)
Ridge regression coefficients can change substantially when multiplying a given predictor by a constant, due to the penalty term
Therefore, it is best to apply ridge regression after standardizing the predictors \[\tilde{x}_{ij} = \dfrac{x_{ij}}{\sqrt{\frac{1}{n} \sum_{i=1}^n (x_{ij} - \overline{x}_j)^2}}\]
This is achieved using scale() in R

Lasso regression

Ridge regression does have an obvious disadvantage: it does not perform variable selection and it includes all \(p\) predictors in the final model.

Lasso Regression is a modification of the least squares criteria that minimizes

\[\underbrace{\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2}_\text{traditional objective function of LS} + \underbrace{\lambda \sum_{j=1}^p |\beta_j|}_\text{shrinkage penalty} = RSS + \lambda \sum_{j=1}^p |\beta_j|\]

The Lasso uses an \(\mathcal{\ell}_1\) penalty instead of an \(\mathcal{\ell}_2\) penalty. The \(\mathcal{\ell}_1\) norm of a coefficient vector \(\mathbf{\beta}\) is given by \(||\mathbf{\beta}||_1 = \sum_{j=1}^p |\beta_j|\).

Lasso regression

As with ridge regression, the lasso shrinks the coefficient estimates towards zero
However, in the case of the lasso, the \(\mathcal{\ell}_1\) penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter \(\lambda\) is sufficiently large: variable selection
We say that the lasso yields sparse models - that is, models that involve only a subset of the variables
As in ridge regression, selecting a good value of \(\lambda\) for the lasso is critical (cross-validation)

Solution path

Lasso for Variable Selection

Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero? One can show that the lasso and ridge regression coefficient estimates solve the constrained optimization problems

\[ \begin{aligned} &\text{Ridge: } \text{argmin}_{\mathbf{\beta}} \ \sum_{i=1}^{n} \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 \quad \text{subject to } \sum_{j=1}^p \beta_j^2 \leq s \\ &\text{Lasso: } \text{argmin}_{\mathbf{\beta}} \ \sum_{i=1}^{n} \left(y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\right)^2 \quad \text{subject to } \sum_{j=1}^p |\beta_j| \leq s \end{aligned} \]

Lasso for Variable Selection

Which one is better?

It depends!

In general, Lasso will perform better than Ridge when a relative small number of predictors have a strong effect on \(Y\), while Ridge will do better when \(Y\) is a function of many of the \(X\)’s and the coefficients are of moderate size
Lasso can be easier to interpret (the zeros help)
If prediction is what we care about, the only way to decide which method is better is comparing their out-of-sample performance

Choosing \(\lambda\)

The idea is to solve the Ridge or Lasso over a grid of possible values for \(\lambda\) and to compute the cross-validation error rate for each value of \(\lambda\).

Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter.

Choosing \(\lambda\)

Left: Ten-fold cross-validation MSE for the lasso, applied to the sparse simulated data set from Slides 9 and 17. Right: The corresponding lasso coefficient estimates are displayed.

Elastic Net

Both Ridge and Lasso regression can be seen as particular cases of Elastic Net regression, i.e.

\[\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p \{ (1 - \alpha) \beta_j^2 + \alpha |\beta_j|\}\]

The elastic-net selects variables like the lasso, and shrinks together the coefficients of correlated predictors like ridge.

Lasso: glmnet(X, y, family = "gaussian", alpha = 1)

Ridge: glmnet(X, y, family = "gaussian", alpha = 0)

Model Selection and Regularization

Summary

Both regularization techniques and model selection techniques are active fields of research
In high dimensional problems, performing variable selection is the preferred approach
All that we saw is still valid for logistic regression
Use cross-validation to choose the optimal model parameters and also the optimal model

Agenda

Model Building Process

Best subset selection: example

Shrinkage methods

Ridge regression

Ridge regression

Ridge regression

Ridge regression

Solution path

Bias-variance tradeoff

Standardizing the variables

Lasso regression

Lasso regression

Lasso regression

Solution path

Lasso for Variable Selection

Lasso for Variable Selection

Which one is better?

Choosing \(\lambda\)

Choosing \(\lambda\)

Elastic Net

Model Selection and Regularization

Summary

Question time